Thursday, May 29, 2008

Computing an unbiased inverse

In a previous entry, Reweighting is biased, I had discussed the problem of a finite-sample size bias in reweighting. In this entry I will investigate a possible solution.


But first, I had asked if this was known in the literature. A comment by Lee Warren pointed to this article, Population size bias in descendant-weighted diffusion quantum Monte Carlo simulations, which discusses this very issue (in a slightly different context, but the root issue is the same). In that context the problem is even more severe since the straightforward solution (increase N) carries a heavier computational cost.


One of the crucial issues is finding a good estimator for the inverse (1/D). The naive estimator (simply do the division) is biased. So the question is: are there any other ways of performing division that could be used? Yes, in digital arithmetic there are a couple of algorithms for division using only multiplication and addition - Newton-Raphson and Goldschmidt (see the Wikipedia entry on digital division)


In general, to have an unbiased estimator, each sample of the random variable must enter the estimator linearly (alternately, we can say each sample should only be used once and then discarded)


The Newton-Raphson iteration for the inverse (1/D) is xi+1 = xi(2-D xi). To make this unbiased, in each iteration the value of D needs to be an independent sample, and the values of xi from previous iterations also need to be computed from independent samples. One can think of the algorithm forming a tree, with each iteration branching to two children (each child is one of the xi computed by the previous iteration), until reaching the leaves representing the initial iteration (x0 = T1 + T2 D)


I tested this algorithm, and it appears to give unbiased results. There is a catch, though. The result has a large variance. Using the average of all the samples with the naive estimator yields a bias that is much smaller than the variance of the unbiased estimator (hence the mean squared error is a more appropriate metric for the combination of variance and bias)


Open questions/issues:


  • Is there some modification of the algorithm that will produce a smaller MSE than the naive estimator? Or even an MSE that isn't too much larger?

  • How many iterations are necessary - how to analyze the convergence of this stochastic Newton-Raphson iteration?

  • Try out the Goldschmidt algorithm.

Sunday, April 06, 2008

Reproducible Research

There were a couple of recent posts on Greg Wilsons's blog about reproducible research. The first is a pointer to a paper about WaveLab. The paper is from 1995, although it appears the philosophy hasn't changed much looking at a more recent (2005) introduction on the website. The second post contains a reference to the Madagascar project.


These projects follow the work of Jon Claerbout on Reproducible Research.


Thoughts


  • This work is all about designing a workflow for computational research, and creating tools to support that workflow. Much of it seems quite banal, but part of the whole purpose of doing this is to free you from having to think about the boring and repetitious steps.
  • Levels of reproducibility

    1. Yourself, right away (when tweaking figures, for instance)

    2. Another research group member, or yourself 6 months later

    3. Someone in another group, (a) able to run the code and get the same results, and (b) able to understand the code

    4. Someone in another group, able to use your description to write their own code that gets the same results


    The last level is the gold standard in scientific reproducibility. Experiments are reproduced from published descriptions (ideally, although the descriptions aren't always complete). Or theorists rederive intermediate steps leading to a paper's conclusion.

    In software development, levels 1-3 are what counts. (It's not very useful software development if others can't run your code and get expected results.). The last level is similar to multiple implementations of a standard (eg, C++ compilers from multiple vendors.)

  • The referenced works seem to focus mostly on processing of raw data (images, seismic imaging, etc) and the processing steps to produce a final result.


    How would this apply in other fields? Say, ones with smaller input data sets, but much higher computational demands.


    The Madagascar project, for example, uses SCons to coordinate the steps in producing research output. What happens when one of the steps involves several weeks of supercomputer time? Two immediate problems that spring to mind - one probably doesn't want this step launched without warning by the script. Furthermore, the steps for launching such a job are likely to be very site-specific, making it difficult for others to reproduce elsewhere.


  • Somewhat related entries of mine: Keeping notes and Solving Science Problems

Wednesday, March 19, 2008

Notes from PyCon 2008

I attended PyCon 2008 this past weekend, and had a good time meeting people in the Python commmunity.


Some notes


  • I tried to give a lightning (5 minute) talk about sympy and code generation. Unfortunately, Ubuntu 7.10 on my laptop (Dell Vostro) failed to display to the external output, so I was unable to present to the larger audience. However, I was able to give the talk to the people at the science BOF session. The slides are available here

  • At the science BOF session I met Michael Tobis, the author an interesting system called PyNSol. It uses python to describe the differential equations, and generates Fortran for performance.

  • The python 2 to 3 converter looks interesting. Not so much for the conversion of python versions, but for the infrastructure developed for the tool.

  • Enthought was showing off some multi-touch displays. Being able to interact with 3D visualizations (and Google Earth) via gestures for zoom, pan, and rotation made for a nice demo. The real question is if this type of display would be useful for actual use.

Monday, March 17, 2008

Code generation now uses templates

In Quameon, the derivatives for the Gaussian basis functions are computed with sympy and the python code is generated. Creating the code that surrounds the generated code is a bit of a tedious procedure - the syntax tree needs to be constructed by hand. But no more! The surrounding code can now written in python and a parser (written using pyparsing) will create the syntax tree for you.


Some details:

Identifiers surrounded by percent signs (eg %name%) get turned into a template node type for subsequent replacement.


Example template:


class %orb_name%:
def compute_value(self,alpha,v,r2):
x=v[0]
y=v[1]
z=v[2]
%value%
return val


The code replaces %orb_name% with the orbital name (s,px,py,px,etc) and %value% gets replaced with the expression for the value of the orbital (the methods for the derivatives not shown - see codegen/primitive_gaussian/deriv.py in SVN for the full template.)

Friday, December 21, 2007

Why python?

The author of the first comment on the Quameon entry wonders if Python will be too slow.


One reason to use Python is ease of programming. Hopefully the advantages of programming in Python will outweigh the performance penalty.


However, at some point, runtime speed will be an issue. Here's a list of steps to improve runtime performance (in increasing order of effort/expected return)


  1. Use psyco. Put "import psyco; pysco.full()" at the beginning of your program and get an instant speed increase.
  2. Use the Shed-Skin Python to C++ compiler. (I haven't tried this yet - don't know if it works with Quameon.)
  3. Profile and rewrite slow parts in C/C++/Fortran. I've done this with the inter-particle distance computation in a classical MC code in Python, and gained some performance.
  4. Code generation. Right now, the code for the Gaussian basis functions is auto-generated (no more computing derivatives by hand!!). I would like to move more of the basic formulas to this technique. Then it should be easier to retarget the code generation to another language (in theory, I haven't tried it yet).


Other tidbits:

Monte Python uses Python and C++ for QMC (paper, code). They wrote the time-consuming parts in C++, and use Python to glue it all together. The algorithm variants they tested (parallel distribution strategies for DMC) were achieved through changing the Python part of the code.


Code generation could potentially result in *faster* code than writing a general framework in C++ or Fortran. This is because the code can be specialized for a given physical system, exposing more optimization opportunities. For an example of code generation, look at the development SVN area of PyQuante - there is a program that will produce a C++ program to compute the energy for a molecule given in the input file.

Wednesday, December 05, 2007

Quameon

I've been working on a Quantum Monte Carlo code implemented in Python. It's called "Quameon" (the name is a mixture of letters from "Quantum Monte Carlo in Python"). There's a SourceForge project for it - here are links to the SourceForge project page, the project web page, and the project wiki.

The code doesn't do much currently, but I mention the project now so I can write blog entries about various aspects of getting the code working.


I'm currently trying to validate that the code works by using HF single particle orbitals (obtained from PyQuante) and no jastrow factor - this should reproduce the HF energy. It seems to work for a few atoms (He, Be, C), after sorting out some normalization issues with the basis sets (apparently all basis sets are not normalized - this doesn't affect the energy, but does affect the eigenvector coefficients). I will need longer runs to be sure. The next step is to profile the code to look for any quick and easy performance tuning opportunities.

The first target (after getting the code running reliably) is to implement several forms for jastrow factors and several VMC parameter optimization algorithms.

Thursday, October 04, 2007

Keeping notes

Keeping a record of work is important for research. I use two main tools for this purpose. First, a text file with entries in chronological order (one file per month, with names like "Sept2007"). I write the date and then start taking notes for that day in the file.

  • Pro - easy to use, chronological view easily available
  • Con - math hard to write well, ideas spread over several days hard to view topically.


The other tool I use is a wiki. It's useful to keep data in a structured, hierarchical format that's easy to edit.

  • Pro - ideas arranged topically, can do better looking math
  • Con - must work topically from the start, no good chronological view (Yes, I know you can view changes in order, but it's not as easy to reconstruct what I was thinking on a particular day.)


I want a single tool that does both of these - that can view the notes both chronologically and by topic.
For work-flow, I want the ability to enter ideas and notes free-form, and perform categorization and annotation at a separate time.

Another additional feature that would be nice is recording other actions on the system (at least by reference).
One example is check-ins to source control.

Another possible example - What if gnuplot were to record all the commands issued (and kept a copy of all the files used?) Graphs could be re-created and played back at a later date easily.

I looked around a little bit to see if there are other applications that might meet some of these requirements (or at least be an improvement over text files and a wiki). One class of app is a note taking application. I looked at Tomboy. It does have a Latex plugin (though I could not get it to work). Latex side-rant: It's great for making nice looking mathematics, but you wind up with "equations under glass". It's pretty to look at, but you can't manipulate it further. For mathematical content, I want something with more precise semantics (Content MathML, for example)

Another class of tool is mind mapping software. I looked at Vym (View your mind) briefly, but I don't think I will like this type of software. Viewing a graph of links is nice, but not as the main working view - I prefer a wiki view (pages with links)